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Abstract 

Statistical models of word-sense disam- 
biguation are often based on a small num- 
ber of contextual features or on a model 
that is assumed to characterize the inter- 
actions among a set of features. Model 
selection is presented as an alternative to 
these approaches, where a sequential search 
of possible models is conducted in order to 
find the model that best characterizes the 
interactions among features. This paper 
expands existing model selection method- 
ology and presents the first comparative 
study of model selection search strategies 
and evaluation criteria when applied to the 
problem of building probabilistic classifiers 
for word-sense disambiguation. 

1 Introduction 

In this paper word-sense disambiguation is cast as 
a problem in supervised learning, where a classifier 
is induced from a corpus of sense-tagged text. Sup- 
pose there is a training sample where each sense- 
tagged sentence is represented by the feature vari- 
ables (Fi, . . . , F n _i, S). Selected contextual proper- 
ties of the sentence are represented by (Fi, . . . , -F n -i) 
and the sense of the ambiguous word is represented 
by S. Our task is to induce a classifier that will 
predict the value of 5 given an untagged sentence 
represented by the contextual feature variables. 

We adopt a statistical approach whereby a prob- 
abilistic model is selected that describes the inter- 
actions among the feature variables. Such a model 
can form the basis of a probabilistic classifier since 
it specifies the probability of observing any and all 
combinations of the values of the feature variables. 
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Suppose our training sample has N sense-tagged 
sentences. There are q possible combinations of val- 
ues for the n feature variables, where each such com- 
bination is represented by a feature vector. Let 
fi and Oi be the frequency and probability of ob- 
serving the i th feature vector, respectively. Then 
(/l) • • • j fq) has a multinomial distribution with pa- 
rameters (N, $i,..., q ). The 9 parameters, i.e., the 
joint parameters, define the joint probability distri- 
bution of the feature variables. These are the pa- 
rameters of the fully saturated model, the model in 
which the value of each variable directly affects the 
values of all the other variables. These parameters 
can be estimated as maximum likelihood estimates 
(MLEs), such that the estimate of 9%, 9%, is 4l. 

For these estimates to be reliable, each of the q 
possible combinations of feature values must occur 
in the training sample. This is unlikely for NLP data 
samples, which are often sparse and highly skew ed 
(c.f., e.g. (|Pedersen et al., 1996|) and flZipf. ' 1935| )). 

However, if the data sample can be adequately 
characterized by a less complex model, i.e., a model 
in which there are fewer interactions between vari- 
ables, then more reliable parameter estimates can be 
obtained: In the case of decomposable models (Dar- 
roch et al., 1980; see below), the parameters of a less 
complex model are parameters of marginal distribu- 
tions, so the MLEs involve frequencies of combina- 
tions of values of only subsets of the variables in the 
model. How well a model characterizes the train- 
ing sample is determined by measuring the fit of the 
model to the sample, i.e., how well the distribution 
defined by the model matches the distribution ob- 
served in the training sample. 

A good strategy for developing probabilistic clas- 
sifiers is to perform an explicit model search to se- 
lect the model to use in classification. This pa- 
per presents the results of a comparative study of 
search strategies and evaluation criteria for measur- 
ing model fit. We restrict the selection process to the 



class of decomposable models ( Darrochct al., 1980 ), 
since restricting model search to this class has many 
computational advantages. 

We begin with a short description of decompos- 
able models (in section 2). Search strategies (in sec- 
tion 3) and model evaluation (in section 4) are de- 
scribed next, followed by the results of an extensive 
disambiguation experiment involving 12 ambiguous 
words (in sections 5 and 6) . We discuss related work 
(in section 7) and close with recommendations for 
search strategy and evaluation criterion when select- 
ing models for word-sense disambiguation. 

2 Decomposable Models 

Decomposable models are a subset of the class 



feature variables, it has been shown that they 
have substantially the same expressive power 
( |Whittaker, 1990[ ). 



of graphical models (Whittaker, 1990) which are 
i n turn a subset of the class of log-linear models 
( Bishop et al., 1975 ). Familiar examples of decom- 
posable models are Naive Bayes and n-gram models. 
They are characterized by the following properties 
( |Brucc and Wiebe, 1994b] ): 



1. In a graphical model, variables are either inter- 
dependent or conditionally independent of one 
another. [J All graphical models have a graphi- 
cal representation such that each variable in the 
model is mapped to a node in the graph, and 
there is an undirected edge between each pair 
of nodes corresponding to interdependent vari- 
ables. The sets of completely connected nodes 
(i.e., cliques) correspond to sets of interdepen- 
dent variables. Any two nodes that are not di- 
rectly connected by an edge are conditionally 
independent given the values of the nodes on 
the path that connects them. 

2. Decomposable models are those graphical mod- 
els that express the joint distribution as the 
product of the marginal distributions of the 
variables in the maximal cliques of the graphical 
representation, scaled by the marginal distribu- 
tions of variables common to two or more of 
these maximal sets. Because their joint distri- 
butions have such closed-form expressions, the 
parameters can be estimated directly from the 
training data without the need for an iterative 
fitting procedure (as is required, for example, to 
estimate the parameters of maximum entropy 
models; (Berger ct al., 1996)). 



3. Although there are far fewer decomposable 
models than log-linear models for a given set of 



The joint parameter estimate Qfl'fa'fesi IS tnc 
probability that the feature vector (/i, J2, /3, Si) will 
be observed in a training sample where each ob- 
servation is represented by the feature variables 
(Fi ,F 2 ,F 3 ,S). Suppose that the graphical represen- 
tation of a decomposable model is defined by the two 
cliques (i.e., marginals) (Fi,S) and (F2,F 3 ,S). The 
frequencies of these marginals, f(F\ = fi,S = Si) 
and f(F 2 = f2,F 3 = f 3 ,S = s 4 ), are sufficient 
statistics in that they provide enough information 
to calculate maximum likelihood estimates of the 
model parameters. MLEs of the model parameters 
are simply the marginal frequencies normalized by 
the sample size N. The joint parameter estimate is 
formulated as a normalized product: 



JFi, F 2 ,F 3 ,S 



f(F 1 = f 1 ,S=s i ) f(F 2 =f 2 ,F 3 =f3,S= Si 
N N 

f(S=8j) 

N 



1 F2 and Fg are conditionally independent given S if 

p(F 2 \F 5 ,S)=p(F2\S). 



(1) 

Rather than having to observe the complete fea- 
ture vector (/i, /2, /3, Si) in the training sample to 
estimate the joint parameter, it is only necessary to 
observe the marginals (/i, s») and {f%, fs, Si). 

3 Model Search Strategies 

The search strategies presented in this paper are 
backward sequential search (BSS) and forward se- 
quential search (FSS). Sequential searches evaluate 
models of increasing (FSS) or decreasing (BSS) lev- 
els of complexity, where complexity is defined by the 
number of interactions among the feature variables 
(i.e., the number of edges in the graphical represen- 
tation of the model). 

A backward sequential search (BSS) begins by 
designating the saturated model as the current 
model. A saturated model has complexity level 
i = r -^\ — i w here n is the number of feature vari- 
ables. At each stage in BSS we generate the set of 
decomposable models of complexity level i — 1 that 
can be created by removing an edge from the cur- 
rent model of complexity level i. Each member of 
this set is a hypothesized model and is judged by 
the evaluation criterion to determine which model 
results in the least degradation in fit from the cur- 
rent model — that model becomes the current model 
and the search continues. The search stops when ei- 
ther (1) every hypothesized model results in an un- 
acceptably high degradation in fit or (2) the current 
model has a complexity level of zero. 



A forward sequential search (FSS) begins by des- 
ignating the model of independence as the current 
model. The model of independence has complexity 
level i = since there are no interactions among the 
feature variables. At each stage in FSS we generate 
the set of decomposable models of complexity level 
i + 1 that can be created by adding an edge to the 
current model of complexity level i. Each member of 
this set is a hypothesized model and is judged by the 
evaluation criterion to determine which model re- 
sults in the greatest improvement in fit from the cur- 
rent model — that model becomes the current model 
and the search continues. The search stops when 
either (1) every hypothesized model results in an 
unacceptably small increase in fit or (2) the current 
model is saturated. 

For sparse samples FSS is a natural choice since 
early in the search the models are of low complexity. 
The number of model parameters is small and they 
have more reliable estimated values. On the other 
hand, BSS begins with a saturated model whose pa- 
rameter estimates are known to be unreliable. 

During both BSS and FSS, model selection also 
performs feature selection. If a model is selected 
where there is no edge connecting a feature variable 
to the classification variable then that feature is not 
relevant to the classification being performed. 

4 Model Evaluation Criteria 

Evaluation criteria fall into two broad classes, signifi- 
cance tests and information criteria. This paper con- 
siders two significance tests, the exact conditional 



test (Krcincr, 1987) and the Log-likelihood ratio 
statistic G 2 (Bishop et al., 1975), and two informa- 



tion criteria, Akaike's Information Criterion (AIC) 
(Akaike, 1974) and the Bayesian Information Crite- 
rion (BIC) flSchwarz, 1978|). 



4.1 Significance tests 

The Log-likelihood ratio statistic G 2 is defined as: 



G 2 = E/iX^T 



(2) 



where fi and e% are the observed and expected counts 
of the i th feature vector, respectively. The observed 
count fi is simply the frequency in the training sam- 
ple. The expected count ei is calculated from the 
frequencies in the training data assuming that the 
hypothesized model, i.e., the model generated in the 
search, adequately fits the sample. The smaller the 
value of G 2 the better the fit of the hypothesized 
model. 



The distribution of G 2 is asymptotically approx- 
imated by the \ 2 distribution (G 2 ~ \ 2 ) "with ad- 
justed degrees of freedom (dof) equal to the number 
of model parameters that have non-zero estimates 
given the training sample. The significance of a 
model is equal to the probability of observing its 
reference G 2 in the \ 2 distribution with appropriate 
dof. A hypothesized model is accepted if the signif- 
icance (i.e., probability) of its reference G 2 value is 
greater than, in the case of FSS, or less than, in the 
case of BSS, some pre-determined cutoff, a. 

An alternative to using a x 2 approximation is to 
define the exact conditional distribution of G 2 . The 
exact conditional distribution of G 2 is the distribu- 
tion of G 2 values that would be observed for com- 
parable data samples randomly generated from the 
model being tested. The significance of G 2 based on 
the exact conditional distribution does not rely on an 
asymptotic approximation and is accurate for sp arse 
and skewed data samples ( Pedersenet al., 1996 ). 



4.2 Information criteria 

The family of model evaluation criteria known as 
information criteria have the following expression: 



IC K = G — k x dof 



(3) 



where G 2 and dof are defined above. Members of 
this family are distinguished by their different values 
of k. AIC corresponds to K = 2. BIC corresponds 
to k — log(N), where N is the sample size. 

The various information criteria are an alterna- 
tive to using a pre-defined significance level (a) to 
judge the acceptability of a model. AIC and BIC re- 
ward good model fit and penalize models with large 
numbers of parameters. The parameter penalty is 
expressed as k x dof, where the size of the penalty 
is the adjusted degrees of freedom, and the weight 
of the penalty is controlled by n. 

During BSS the hypothesized model with the 
largest negative IC K value is selected as the cur- 
rent model of complexity level i — 1, while during 
FSS the hypothesized model with the largest pos- 
itive IC K value is selected as the current model of 
complexity level i + 1. The search stops when the 
IC K values for all hypothesized models are greater 
than zero in the case of BSS, or less than zero in the 
case of FSS. 

5 Experimental Data 

The sense-tagged text and feature set used in 



these experiments are the same as in ( Bruce et al., 
|1996| ). The text consists of every sentence from the 
ACL/DCI Wall Street Journal corpus that contains 



any of the nouns interest, bill, concern, and drug, 
any of the verbs close, help, agree, and include, or 
any of the adjectives chief, public, last, and common. 

The extracted sentences have been hand-tagged 
with senses defined in the Longman Dictionary of 
Contemporary English (LDOCE). There are be- 
tween 800 and 3,000 sense-tagged sentences for each 
of the 12 words. This data was randomly divided 
into training and test samples at a 10:1 ratio. 

A sentence with an ambiguous word is represented 
by a feature set with three types of contextual fea- 
ture variables:!] (1) The morphological feature (E) 
indicates if an ambiguous noun is plural or not. For 
verbs it indicates the tense of the verb. This feature 
is not used for adjectives. (2) The POS features 
have one of 25 possible POS tags, derived from the 
first letter of the tags in the ACL/DCI WSJ cor- 
pus. There are four POS feature variables repre- 
senting the POS of the two words immediately pre- 
ceding (L\,L-i) and following (i?i,i? 2 ) the ambigu- 
ous word. (3) The three binary collocation-specific 
features (C\, Ci, Cj.) indicate if a particular word oc- 
curs in a sentence with an ambiguous word. 

The sparse nature of our data can be illustrated 
by interest. There are 6 possible values for the sense 
variable. Combined with the other feature variables 
this results in 37,500,000 possible feature vectors (or 
joint parameters). However, we have a training sam- 
ple of only 2,100 instances. 

6 Experimental Results 

In total, eight different decomposable models were 
selected via a model search for each of the 12 words. 
Each of the eight models is due to a different com- 
bination of search strategy and evaluation criterion. 
Two additional classifiers were evaluated to serve as 
benchmarks. The default classifier assigns every in- 
stance of an ambiguous word with its most frequent 
sense in the training sample. The Naive Bayes clas- 
sifier uses a model that assumes that each contex- 
tual feature variable is conditionally independent of 
all other contextual variables given the value of the 
sense variable. 

6.1 Accuracy comparison 

The accuracy^ of each of these classifiers for each 
of the 12 words is shown in Figure [j]. The highest 
accuracy for each word is in bold type while any ac- 
curacies less than the default classifier are italicized. 



2 An alternative feature set for this data is utilized 
with an e xemplar-based learning algorithm in ( Ng and 



Lee, 1996) 

3 



The percentage of ambiguous words in a held out 
test sample that are disambiguated correctly. 



The complexity of the model selected is shown in 
parenthesis. For convenience, we refer to model se- 
lection using, for example, a search strategy of FSS 
and the evaluation criterion AIC as FSS AIC. 

Overall AIC selects the most accurate models dur- 
ing both BSS and FSS. BSS AIC finds the most ac- 
curate model for 6 of 12 words while FSS AIC finds 
the most accurate for 4 of 12 words. BSS BIC and 
the Naive Bayes find the most accurate model for 3 
of 12 words. Each of the other combinations finds 
the most most accurate model for 2 of 12 words ex- 
cept for FSS exact conditional which never finds the 
most accurate model. 

Neither AIC nor BIC ever selects a model that 
results in accuracy less than the default classifier. 
However, FSS exact conditional has accuracy less 
than the default for 6 of 12 words and BSS exact 
conditional has accuracy less than the default for 3 
of 12 words. BSS G 2 ~ X 2 and FSS G 2 ~ X 2 have 
less than default accuracy for 2 of 12 and 1 of 12 
words, respectively. 

The accuracy of the significance tests vary greatly 
depending on the choice of a. Of the various a values 
that were tested, .01, .05, .001, and .0001, the value 
of .0001 was found to produce the most accurate 
models. Other values of a will certainly led to other 
results. The information criteria do not require the 
setting of any such cut-off values. 

A low complexity model that results in high accu- 
racy disambiguation is the ultimate goal. Figure [fl 
shows that BIC and G 2 ~ \ 2 select lower complexity 
models than either AIC or the exact conditional test. 
However, both appear to sacrifice accuracy when 
compared to AIC. BIC assesses a greater parame- 
ter penalty (k = log(N)) than does AIC (k = 2), 
causing BSS BIC to remove more interactions than 
BSS AIC. Likewise, FSS BIC adds fewer interactions 
than FSS AIC. In both cases BIC selects models 
whose complexity is too low and adversely affects 
accuracy when compared to AIC. 

The Naive Bayes classifier achieves a high level 
of accuracy using a model of low complexity. In 
fact, while the Naive Bayes classifier is most accu- 
rate for only 3 of the 12 words, the average accu- 
racy of the Naive Bayes classifiers for all 12 words 
is higher than the average classification accuracy re- 
sulting from any combination of the search strategies 
and evaluation criteria. The average complexity of 
the Naive Bayes models is also lower than the av- 
erage complexity of the models resulting from any 
combination of the search strategies and evaluation 
criteria except BSS BIC and FSS BIC. 
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Figure 3: Effect of Search Strategy 



6.2 Search strategy and accuracy 

An evaluation criterion that finds models of simi- 
lar accuracy using either BSS or FSS is to be pre- 
ferred over one that does not. Overall the infor- 
mation criteria are not greatly affected by a change 
in the search strategy, as illustrated in Figure ||. 
Each point on this plot represents the accuracy of 
the models selected for a word by the same evalua- 
tion criterion using BSS and FSS. If this point falls 
close to the line BSS = FSS then there is little 
or no difference between the accuracy of the models 
selected during FSS and BSS. 

AIC exhibits only minor deviation from BSS = 
FSS. This is also illustrated by the fact that the 
average accuracy between BSS AIC and FSS AIC 
only differs by .0013. The significance tests, espe- 
cially the exact conditional, are more affected by 
the search strategy. It is clear that BSS exact condi- 
tional is much more accurate than FSS exact condi- 
tional. FSS G 2 ~ x 2 is slightly more accurate than 



BSSG 2 



X 



6.3 Feature selection: interest 

Figure shows the models selected by the various 
combinations of search strategy and evaluation cri- 
terion for interest. 

During BSS, AIC removed feature L2 from the 
model, BIC removed L 1 ,L 2 ,Ri and i? 2 , G 2 ~ x 2 
removed no features, and the exact conditional test 
removed C^. During FSS, AIC never added i? 2 , BIC 
never added C\, C3, Li, L 2 and i? 2 , and G 2 ~ x 2 an d 



the exact conditional test added all the features. 

G 2 ~ x 2 is the most consistent of the evaluation 
criteria in feature selection. During both BSS and 
FSS it found that all the features were relevant to 
classification. 

AIC found seven features to be relevant in both 
BSS and FSS. When using AIC, the only difference 
in the feature set selected during FSS as compared 
to that selected during BSS is the part of speech 
feature that is found to be irrelevant: during BSS L 2 
is removed and during FSS i? 2 is never added. All 
other criteria exhibit more variation between FSS 
and BSS in feature set selection. 

6.4 Model selection: interest 

Here we consider the results of each stage of the 
sequential model selection for interest. Figures H 
through [7] show the accuracy and recallF] for the 
best fitting model at each level of complexity in the 
search. The rightmost point on each plot for each 
evaluation criterion is the measure associated with 
the model ultimately selected. 

These plots illustrate that BSS BIC selects mod- 
els of too low complexity. In Figure [| BSS BIC has 
"gone past" much more accurate models than the 
one it selected. We observe the related problem for 
FSS BIC. In Figure | FSS BIC adds too few in- 
teractions and does not select as accurate a model 
as FSS AIC. The exact conditional test suffers from 
the reverse problem of BIC. BSS exact conditional 
removes only a few interactions while FSS exact con- 
ditional adds many interactions, and in both cases 
the resulting models have poor accuracy. 

The difference between BSS and FSS is clearly il- 
lustrated by these plots. AIC and BIC eliminate in- 
teractions that have high dof 's (and thus have large 
numbers of parameters) much earlier in BSS than 
the significance tests. This rapid reduction in the 
number of parameters results in a rapid increases 
in accuracy (Figure 0) and recall for AIC and BIC 
(Figure 0) relative to the significance tests as they 
produce models with smaller numbers of parameters 
that can be estimated more reliably. 

However, during the early stages of FSS the num- 
ber of parameters in the models is very small and the 
differences between the information criteria and the 
significance tests are minimized. The major differ- 
ence among the criteria in Figures @ and R is that the 
exact conditional test adds many more interactions. 



4 The percentage of ambiguous words in a held out test 
sample that are disambiguated, correctly or not. A word 
is not disambiguated if the model parameters needed to 
assign a sense tag cannot be estimated from the training 
sample. 



7 Related Work 

Statistical analysis of NLP data has often been lim- 
ited to the application of standard models, such 
as n-gram (Markov chain) models and the Naive 
Bayes model. While n-grams perform well in part- 
of-speech tagging and speech processing, they re- 
quire a fixed interdependency structure that is inap- 
propriate for the broad class of contextual features 
used in word-sense disambiguation. However, the 
Naive Bayes classifier has been found to perform 
well for word-sense disambiguation both here and 
in a variety of other works (e. g., (|Bruce and Wiebe 
1994a|), (pale et al., 199^), QLeacock et al., 1993|) 



and ( |Mooney, 1996| )). 



In order to utilize models with more complicated 
interactions among feature variables, (Bruce and 
Wiebe, 1994b| ) introduce the use of sequential model 



selection and decomposable models for word-sense 
disambiguation!] 

Alternative probabilistic approaches have involved 
using a single co ntextual feature to p erform disam 
biguation (e .g., (Brown et al., 1991 ), ( Pagan et al . 



1991 ), and ( Yarowsky, 1993 ) present techniques for 



identifying the optimal feature to use in disambigua- 
tion). Maximum Entropy models have been used to 
express the interactions among multiple feature vari- 
ables (e.g., ( Bcrgcr ct al., 1996 )), but within this 
framework no systematic study of interactions has 
been proposed. Decision tree induction ha s been 
applied to word-sense disambiguation (e.g. (Black. 
1988| ) and ( |Mooney, 1996J )) but, while it is a type of 



model selection, the models are not parametric. 

8 Conclusion 

Sequential model selection is a viable means of 
choosing a probabilistic model to perform word- 
sense disambiguation. We recommend AIC as the 
evaluation criterion during model selection due to 
the following: 

1. It is difficult to set an appropriate cutoff value 
(a) for a significance test. 

2. The information criteria AIC and BIC are more 
robust to changes in search strategy. 

3. BIC removes too many interactions and results 
in models of too low complexity. 



5 They recommended a model selection procedure us- 
ing BSS and the exact conditional test in combination 
with a test for model predictive power. In their proce- 
dure, the exact conditional test was used to guide the 
generation of new models and the test of model predic- 
tive power was used to select the final model from among 
those generated during the search. 



The choice of search strategy when using AIC is 
less critical than when using significance tests. How- 
ever, we recommend FSS for sparse data (NLP data 
is typically sparse) since it reduces the impact of very 
high degrees of freedom and the resultant unreliable 
parameter estimates on model selection. 

The Naive Bayes classifier is based on a low com- 
plexity model that is shown to lead to high accuracy. 
If feature selection is not in doubt (i.e., it is fairly 
certain that all of the features are somehow relevant 
to classification) then this is a reasonable approach. 
However, if some features are of questionable value 
the Naive Bayes model will continue to utilize them 
while sequential model selection will disregard them. 

All of the search strategies and evaluation crite- 
ria discussed are implemented in the public domain 
program CoCo ( Badsberg, 1995| ). 
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Figure 1: Accuracy comparison 
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Figure 2: Models selected: interest 
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Figure 4: BSS accuracy: interest 



Figure 6: FSS accuracy: interest 
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Figure 5: BSS recall: interest 



Figure 7: FSS recall: interest 



